Apparatus and method for text segmentation based on coherent units

ABSTRACT

The invention provides a text segmentation apparatus comprising means for analyzing an electronic text to determine likelihood of segmentation point for each of sentence ends in the text based on a coherent unit and means for segmenting the text into text segments based on the likelihood of segmentation point. The apparatus is programmed to segment the text segment at the position having the best likelihood of segmentation point within the text segment when the size of any of the segmented text segments exceeds a threshold value to be determined based on the specified text segmentation size. Particularly, the apparatus determines the similarity between the text parts contained in a pair of windows to be set up on the left and right sides of each sentence end position in the text so as to obtain similarity curves. Then, the apparatus determines the likelihood of segmentation point for each sentence end point based on the obtained similarity curves. The apparatus segments the text at the point having the best likelihood of segmentation point and further segments it at the point of the second best likelihood of segmentation point, and so on, until the size of all of the text segments becomes approximately equal to the specified segment size.

[0001] This application claims priority from PCT Patent Application No.PCT/US01/30734, filed Oct. 02, 2001 and Japanese Patent Application No.2000-302321, filed Oct. 02, 2000.

FIELD OF THE INVENTION

[0002] The invention relates to a text segmentation technique and, morespecifically, to a text segmentation technique for segmenting textsbased on coherent units.

BACKGROUND OF THE INVENTION

[0003] After a text has been retrieved through a text search process, auser must make a further search for retrieving the required text partfrom the displayed text if the searched text is still so bulky as tocontain many topics. In such a case, if the user makes a search throughthe text segments that have been segmented beforehand based on topics,the user will be able to immediately display the desired text segment.And accordingly it may become unnecessary for the user to make a furthersearch for the required part. Therefore, if the text is segmented basedon topics, it will be easy to perform various text processingapplications.

[0004] Several text segmentation methods are disclosed in, for example,Laid Open Japanese Patent Application No. H11-242684, Laid Open JapanesePatent Application No. 2000-235574 and Laid Open Japanese PatentApplication No. H10-72724. Laid Open Japanese Patent Application No.H11-242684 proposes a text segmentation apparatus wherein texts arehandled in terms of not only association between adjacent sentences butalso global sentence association. Japan Patent Application No.2000-235574 proposes a method for obtaining segmentation points based ona square matrix whose elements include relativeness between theparagraphs, the text being segmented in accordance with a paragraphformat and the like. Laid Open Japanese Patent Application No. H10-72724proposes a method comprising the steps of determining the relativenessat each position based on a plurality of windows, determining the borderof the topics for each layer and integrating those borders to identifythe topic border.

[0005] It is possible to segment a text in terms of topics using theabove-referenced methods. However, those referenced methods do not takeinto consideration the size of the text. In particular, when using suchequipments as mobile telephones and PDA devices that have somelimitation on the resources, e.g., small size of the display, users mayneed an extra operation, for example, a scrolling, to display thesegmented text segments. In addition, the size of the text segments maybe beyond the limit of the storage of such equipment. Accordingly, thetext segment that is segmented by one of the above-referencedconventional text segmentation methods cannot be necessarily a desirablesegmentation unit to users and/or terminal devices.

[0006] Therefore, there is a need for a text segmentation method forsegmenting a text in accordance with coherent units as well as aspecified text segment size. Also, there is a need for a technique forproviding a group of text segments which users can easily read througheven if they are displayed on small size display screens of mobiletelephones and/or PDA devices.

SUMMARY OF THE INVENTION

[0007] A text segmentation apparatus provided in accordance with oneaspect of the invention comprises means for analyzing an electronic textto determine likelihood of segmentation point for each of sentence endsin said text based on a coherent unit and means for segmenting said textinto text segments based on said likelihood of segmentation point and aspecified text segmentation size.

[0008] A text segmentation apparatus provided in accordance with anotheraspect of the invention comprises means for analyzing an electronic textto determine likelihood of segmentation point for each of sentence endsin said text based on a coherent unit and means for segmenting said textinto text segments based on said likelihood of segmentation point,wherein when the size of any of said segmented text segments exceeds athreshold value to be determined based on a specified text segmentationsize, said text segmentation apparatus is programmed to segment saidtext segment at the position having best likelihood of segmentationpoint within said text segment.

[0009] In accordance with one embodiment of the invention, the text issegmented into a group of text segments, each having the approximatelyequal size to the specified one. In order to achieve this, the inventiveapparatus first determines the similarity between the text partscontained in a pair of windows to be set up on the left and right sidesof each sentence end position in the text so as to obtain similaritycurves. Then, the apparatus determines the likelihood of segmentationpoint for each sentence end point based on the obtained similaritycurves. The apparatus segments the text at the point having the bestlikelihood of segmentation point and further segments it at the point ofthe second best likelihood of segmentation point, and so on, until thesize of all of the text segments becomes approximately equal to thespecified segment size.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010]FIG. 1 is a block diagram illustrating a text segmentationapparatus in accordance with one embodiment of the invention;

[0011]FIG. 2 is a schematic flow chart illustrating a first half of thetext segmentation algorithm;

[0012]FIG. 3 is a schematic flow chart illustrating a second half of thetext segmentation algorithm;

[0013]FIG. 4 is a schematic flow chart illustrating an algorithm forperforming an association process between text segments;

[0014]FIG. 5 is a graphic chart illustrating a similarity curve when thewindow size is set to 480 words;

[0015]FIG. 6 is a graphic chart illustrating a similarity curve when thewindow size is set to 240 words;

[0016]FIG. 7 is a graphic chart illustrating a similarity curve when thewindow size is set to 120 words;

[0017]FIG. 8 is a graphic chart illustrating a similarity curve when thewindow size is set to 60 words; and

[0018]FIG. 9 is a graphic chart illustrating a curve for the likelihoodof segmentation points.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0019] One embodiment in accordance with the invention will be describedin detail in the following with reference to the attached drawings. FIG.1 illustrates the functional blocks showing the structure of the systemin accordance with one embodiment of the invention. The system of thisembodiment comprises a general purpose computer, workstation or personalcomputer in terms of hardware structure. The invention can beimplemented through the execution of the computer program on the generalpurpose computer. Each of the blocks shown in FIG. 1 represents therespective function embodied by such computer program.

[0020] A morphological analysis block 2 receives an electronic text 1 asan object to be segmented, extract words from the text and append theinformation upon the part of speech to each of extracted words. A windowsize setup block 3 sets up a window size to be used for measuring thesimilarity between the adjacent sentences contained in the concernedtext. The window size is defined as a predetermined length in the leftand right directions from a sentence end position. A similaritymeasurement block 4 measures, at each sentence end position, thesimilarity between the text portions contained in the left and rightwindows that have been set up by the window size setup block 3 andgenerate the corresponding similarity curve.

[0021] A determination block 5 is for determining the likelihood ofsegmentation point at each sentence end position based on the similaritycurve generated by the similarity measurement block 4. A segmentationpoint determination block 6 uses the likelihood of segmentation pointdetermined in the determination block 5 to select as a segmentationpoint the position having the best likelihood of segmentation pointwithin the largest text segment. At the starting point of the processwhen the text 1 is not segmented yet, the entire text 1 is to beregarded as the largest text segment.

[0022] A size comparison block 11 compares the size of the candidatetext segment selected by the segmentation point determination block 6with a threshold size value to be determined based on the text segmentsize specified by the output equipment. If the size of the candidatetext segment is larger than the threshold size, the position having thebest likelihood of segmentation point in that candidate text segment maybe selected as a segmentation point. A text segment generation block 7collects the candidate text segments obtained through the previousblocks to generate a set of text segments. Until the size of all of thetext segments within the set becomes smaller than the specified size,the process may return to the segmentation point determination block 6and the size comparison block 11 to repeat their processes.

[0023] A relativeness determination block 8 determines the similaritybetween adjacent segments generated by the text segment generation block7 and performs an association process upon those text segments usingthat similarity. A link generation block 12 generates a link between thetext segments having the high association in terms of contents based onthe determination result by the relativeness determination block 8.Thus, such generated text segments may be transmitted to the requestingterminal equipment, e.g., a PDA or a mobile telephone.

[0024] In one embodiment, a text segmentation apparatus in accordancewith the invention may be used under the Internet environment. Forexample, a user may use a PDA to access a web site via Internet, searchfor the data and display the acquired data on a PDA browser. In thiscase, the web site may utilize the inventive text segmentation apparatusto segment the text to be transmitted to the PDA into text segments sothat the size of the text segments can match the display screen of thePDA. The text segments may be converted to the HTML format andaccordingly appropriate hyperlinks for pointing to the associated textsegments may be embedded before transmission to the PDA. Because thesize of the text segments already matches the display screen of thatPDA, the user can jump to the next text segment or to another textsegment having a higher relativeness in terms of the content by clickinga button, so the user can comfortably view the text even on the smallsize display screen.

[0025]FIG. 2 and FIG. 3 illustrate flowcharts of the text segmentationalgorithm. FIG. 4 illustrates a flowchart of the association algorithmamong the text segments. At first, with reference to FIG. 2, the processreceives an electronic text D containing N sentences and M words as wellas the optimum text segment size S(step 202). The optimum text segmentsize means that it should be defined based on the number of thecharacters specified by the user or the number of displayable characterson the PDA or the mobile telephone. For example, the optimum textsegment size may be 100 characters when the terminal device can display100 characters.

[0026] In the next step 203, the process performs a morphologicalanalysis on the input electronic text D to extract the words and givethe information on the part of speech to each of the words. In step 204,the process extracts from those words any noun that appears more thantwice in the text D as a term t_(i) and generates a term list T(=t₁, t₂,t₃, . . . ,t_(n)).

[0027] The process continues to set up the window size B in step 205.The window size B may be initially set to, for example, a fifth (⅕) ofthe number M of the words contained in the concerned text. Then, in step206, the process sets a pair of the windows each having the window sizeB on the left and right sides of the respective end positions of thesentences contained in the concerned text and obtains a vectorW=(w_(t1), W_(t2), W_(t3), . . . , W_(tn)) using the above describedterms as its elements from the text portions contained in the left andright windows, where w_(t1) represents the occurrence frequency of theterm t₁ in the text contained in the window. In step 207, the processdetermines the cosine measure sim(b_(l),b_(r)) as the similarity at thatposition from the two vectors above obtained. The cosine measure may beobtained by the following equation (1); $\begin{matrix}{{{sim}\left( {b_{l},b_{r}} \right)} = \frac{W_{bl} \cdot W_{br}}{\sqrt{W_{bl}^{2} + W_{br}^{2}}}} & (1)\end{matrix}$

[0028] where b_(l) and b_(r) represent the text portions contained inthe left and right windows respectively, and W_(bl) and W_(br) meansvectors representing the occurrence frequency of the terms appearingrespectively in the left and right windows. As the number of the termsappearing in both left and right windows increases, the similaritygained in the equation (1) will be getting higher (up to 1 at maximum).If there is no common term, the similarity becomes zero. That is, thelarger value of the similarity means a higher probability of commontopics being included in the left and right windows whereas the lessvalue means a high probability of the topic boundary.

[0029] The suffix i shown in FIG. 2 represents the number of thesentence contained in the text where the sequence of the numbers 1, 2,3, . . . , N are given to the sentences starting from the top sentenceof the concerned text. Accordingly, the similarity determination for theend position of each sentence is performed by incrementing i until itreaches N, that is, it is determined NO in step 209. Thus, thesimilarity curve for the concerned text may be obtained. FIG. 5 throughFIG. 8 illustrate similarity curves for the following sample input text:TABLE 1 The community of mostly volunteer programmers that has builtLinux into a formidable operating system is getting some help fromcomputer industry giants. International Business Machines Inc., IntelCorp., Hewlett-Packard Co. and NEC Corp. are announcing Wednesday thatthey will create a laboratory with an investment of several milliondollars where programmers can test Linux software on the large computersystems that are common in the corporate world. The lab is expected toopen by the end of the year near Portland, Ore. Linux is an “opensource” operating system that anyone can modify, as long as themodifications are made available for free on the Internet. It has adevoted following among programmers, who collaborate on softwareprojects over the Web. These software engineers can usually only testsoftware on their own desktop computers, part of the reason Linux is nowrarely used on larger computers. “The Open Source Development Lab willhelp fulfill a need that individual Linux and open source developersoften have: access to high-end enterprise hardware,” said BrianBehlendorf, creator of the open source Web server software Apache.Irving Wladawsky-Berger, the head of IBM's Linux group, said the labwould help companies run hardware from different vendors together, aswell as let run “clusters” of computers working as one. The four mainsponsors said they will contribute several millions of dollars to theproject. The lab is also backed by smaller companies that specialize inLinux products, like Red Hat Inc., Turbolinux Inc., Linuxcare Inc. andVA Linux Systems Inc., as well as Dell Computer Corp. and SiliconGraphics Inc. The founding companies said the lab will be run by anonprofit organization that will select the software projects that gainaccess to the lab in an “open, neutral process.” Linux is seen as analternative to proprietary operating systems like Microsoft's Windowsand Apple OS. Its backers say the publicly available source code, orsoftware blueprint, makes it more flexible and reliable. Analyst BillClaybrook at Aberdeen Group said the project sponsors are backing Linuxbecause it gives them a chance to influence an operating system fortheir computers. “These companies see that they can play a much moreimportant role in developing Linux than they can in, let's say Windows,because Microsoft pretty much decides what to put in Windows,” he said.

[0030] In FIG. 5 through FIG. 8, horizontal axes represent each sentenceend position, the vertical axes represent the corresponding similarity,and window sizes represent the number of the words containedrespectively in the left and right windows.

[0031] The process continues to obtain the likelihood of segmentationpoints f(c) for each end position of the sentences c based on thosesimilarity curves. The likelihood of segmentation points f(c) may bedetermined from the following equations. In step 209, when i=N, in otherwords, once the process has gained the similarity curve under thecondition of B=M/5 for all the sentences contained in the concernedtext, the process sets i=1 (step 212) and determines the likelihood ofsegmentation point for the end position of the first sentence (step213). This determination will be repeated by incrementing i (step 216)until it reaches N.

f(c)=α·fs(c)+β·fg(c)  (2)

fs(c)=1−s(c)  (3) $\begin{matrix}{{{fg}(c)} = \frac{\left( {{s\left( {c -} \right)} - {s(c)}} \right) + \left( {{s\left( {c +} \right)} - {s(c)}} \right)}{2}} & (4)\end{matrix}$

[0032] where s(c) represents the similarity at the end position c ofeach sentence, s(c−) represents the similarity at the end position ofthe sentence immediately before the end position c, s(c+) represents thesimilarity at the end position of the sentence immediately after the endposition c, and alpha (α) and beta (β) are parameters to beappropriately determined through the experiment.

[0033] The value of the likelihood of segmentation point of the equation(2) may become larger when the corresponding similarity is at minimalpoint or the magnitude of the transition between adjacent similaritiesis large, whereas it becomes smaller when the corresponding similarityis large or the magnitude of the transition between adjacentsimilarities is small.

[0034] When i reaches N (that is, it is determined NO in step 215), theprocess sets the window size B to a half (½) of the initial size andreturn to step 206 to repeat the subsequent steps. Then, after thosesteps complete, the process further sets the window size B to the halfof the current one and repeats the subsequent steps. These processes arerepeated until j reaches the total number L of the similarity curves,wherein it is determined NO in step 217.

[0035] Then, using L pieces of the likelihood of segmentation point f(c)gained for the respective window sizes, the overall likelihood ofsegmentation point F(c) for the input text D may be gained as follows:$\begin{matrix}{{F(c)} = {\sum\limits^{L}{{\gamma_{j} \cdot \log}\quad {f_{j}(c)}}}} & (5)\end{matrix}$

[0036] where f_(j)(c) represents the likelihood of segmentation pointgained from the jth similarity curve, γ_(j) means a weighting factor foreach similarity curve. As for the value for γ_(j), for example, 1 isgiven to the likelihood of segmentation point for the largest windowsize, ½ for the second largest, ¼ for the third one, and so on. The textsegmentation process in this embodiment in the following will beperformed based on the likelihood curve of segmentation point gained byequation (5). FIG. 9 shows such a likelihood curve of segmentationpoint.

[0037] Now, with reference to FIG. 3, the process starts at step 301where the entire text before segmentation is represented as a textsegment R₀. In step 302, the process selects the segment R_(i) havingthe largest size from the text segment set R. The text segment set Rinitially includes only the text segment R₀ comprising the entire text.

[0038] The process continues to step 303 to compare the size of theselected text segment R_(i) with the segment size threshold Th_(size),which is to be determined based on the specified size, that is, theoptimum segment size S. For example, if the segment size thresholdTh_(size) is determined to be 1.1 times of the optimum segment size S,the text segment having the size within 110% of the optimum segment sizemay be accepted.

[0039] If the size of the text segment R_(i) exceeds the thresholdTh_(size), the end position c of the sentence that has the bestlikelihood of segmentation point f within the segment R_(i) may beselected as a segmentation point in step 305. In step 307, the processmay segment the text segment R_(i) to generate new segments Rl′, Rr′.When either of the segmented text segments Rl′, Rr′ is too much smallerthan the specified size S (step 308), the process may revisit theprevious unsegmented text segment R_(i) and select as a segmentationpoint the end position of the sentence that has the second bestlikelihood of segmentation point within the segment R_(i) and segment itaccordingly (step 309).

[0040] Once the segment Rl′ or Rr′, the size of which is not too muchsmall relative to the specified size S, has been thus gained, the R_(i)is removed from the text segment set R, and the segments Rl′ and Rr′ areadded to the text segment set R (step 311).

[0041] Then, back to step 302, the process may repeat the stepsfollowing the step 305 for the text segments having the size exceedingthe threshold Th_(size) until the size of the largest text segment amongall text segments becomes smaller than the threshold Th_(size), whereinit is determined NO in step 303. By starting from the segment having thebest likelihood of segmentation point to perform the text segmentationprocess in sequence, it becomes possible to generate text segmentshaving the approximately equal size while maintaining the global topicboundary of the text.

[0042] Table 2 shows a group of the text segments when the inputdocument D shown in Table 1 has been segmented with the optimum segmentsize specified as 400 characters. It is understood that the size of eachsegment is almost equal to 400 characters as specified. Table 3 alsoshows an example of the text segments described in the markup languageformat. TABLE 2 Text segment 1: The community of mostly volunteerprogrammers that has built Linux into a formidable operating system isgetting some help from computer industry giants. International BusinessMachines Inc., Intel Corp., Hewlett-Packard Co. and NEC Corp. areannouncing Wednesday that they will create a laboratory with aninvestment of several million dollars where programmers can test Linuxsoftware on the large computer systems that are common in the corporateworld. Text segment 2: The lab is expected to open by the end of theyear near Portland, Ore. Linux is an “open source” operating system thatanyone can modify, as long as the modifications are made available forfree on the Internet. It has a devoted following among programmers, whocollaborate on software projects over the Web. These software engineerscan usually only test software on their own desktop computers, part ofthe reason Linux is now rarely used on larger computers. Text segment 3:“The Open Source Development Lab will help fulfill a need thatindividual Linux and open source developers often have: access tohigh-end enterprise hardware,” said Brian Be, creator of the open sourceWeb server software Apache. Irving Wladawsky-Berger, the head of IBM'sLinux group, said the lab would help companies run hardware fromdifferent vendors together, as well as let run “clusters” of computersworking as one. Text segment 4: The four main sponsors said they willcontribute several millions of dollars to the project. The lab is alsobacked by smaller companies that specialize in Linux products, like RedHat Inc., Turbolinux Inc., Linuxcare Inc. and VA Linux Systems Inc., aswell as Dell Computer Corp. and Silicon Graphics Inc. The foundingcompanies said the lab will be run by a nonprofit organization that willselect the software projects that gain access to the lab in an “open,neutral process.” Text segment 5: Linux is seen as an alternative toproprietary operating systems like Microsoft's Windows and Apple OS. Itsbackers say the publicly available source code, or software blueprint,makes it more flexible and reliable. Analyst Bill Claybrook at AberdeenGroup said the project sponsors are backing Linux because it gives thema chance to influence an operating system for their computers. “Thesecompanies see that they can play a much more important role indeveloping Linux than they can in, let's say Windows, because Microsoftpretty much decides what to put in Windows,” he said.

[0043] TABLE 3 <?xml version=“1.0” encoding=“Shift_JIS” ?> <text> <blockid=“0”> <start_sentence href=“input.xml#xpointer(@id=s0)”/><end_sentence href=“input.xml#xpointer(@id=s1)”/> <block_body> Thecommunity of mostly volunteer programmers that has built Linux into aformidable operating system is getting some help from computer industrygiants. International Business Machines Inc., Intel Corp.,Hewlett-Packard Co. and NEC Corp. are announcing Wednesday that theywill create a laboratory with an investment of several million dollarswhere programmers can test Linux software on the large computer systemsthat are common in the corporate world. </block_body> </block> </text>

[0044] With reference to FIG. 4, the association process for the textsegments will be described. The process uses equation (1) to determinethe similarity q between any pair of the text segments that have beengained through the above-described processes or between the importantwords and any one of such text segments (step 402). When the similarityq is larger than the relevant threshold Th_(relevant) (step 403), theprocess may determine that similar topics are included between thosetext segments and embed an association link between those text segments(step 405). The relevant threshold Th_(relevant) may be for example 0.5.In one embodiment of the invention, the relevant threshold Th_(relevant)may be specified by the user preparing for such situations where theuser may wish to display only segments of which relativeness is high orall of the segments as long as there is association among them.

[0045] Hyperlinks between the text segments having the similarity interms of topics may be embedded into the corresponding text segments bymeans of the markup language. Obviously, the target of the hyperlink isnot limited to one text segment but may be linked to plural textsegments. By employing the Xpointer of XML as a markup language forexample, links to a plurality of text segments could be constructed,enabling such a mechanism for displaying a plurality of associatedsegments from one text segment to be implemented on the browser.

[0046] It should be particularly noted that the invention as describedabove with reference to the specific embodiments is not intended onlyfor the English text but it might be applicable to any other languagetext including Japanese in accordance with the equivalent processes oncondition that the morphological analysis is performed upon thatlanguage.

[0047] In accordance with the invention, a text is segmented intosmaller segments having the almost same size with the specified one, sothe text can be efficiently displayed to the user even on a smaller sizescreen such as a mobile terminal's screen. In particular, the user willbe able to determine at a glance if the text being displayed is requiredor not because the text segments can be generated so as to meet thescreen size. In one embodiment, it is further possible for the user toscroll the text on a text segment basis when displaying the text becausethe text segments can be generated so as to meet the screen.

[0048] In accordance with one embodiment of the invention, theassociation between the text segments having the same topics isestablished, so it is possible for users to access another associatedtext segment easily. Besides, terminal display devices do not need alarger size of the storage because it can display the text on a segmentbasis instead of displaying the whole text. Furthermore, since the textcan be transmitted segment by segment to the terminal display,limitations on the transmission packet size and/or the hardware could betaken into consideration at the transmission time. Also, the user couldimmediately read required portion of the text by presenting the searchresult as text segments to the user.

[0049] Furthermore, because coherent units are represented by textsegments that are automatically extracted in accordance with theinvention, the user may extract important words or sentences for eachsegment in accordance with known methods as disclosed in some knownliteratures (for example, a literature by M. Kameda entitled “Retrievalof important sentences based on the paragraph shifting method usingrelativeness between the paragraph and the sentence”, Natural LanguageProcess Study Group Report, Information Process Society, 1997, 119-126.121-17), or generate a text summary for each segment in accordance withknown methods as disclosed in some known literatures (for example, aliterature by Y. Nakao entitled “Summary generation based on automaticrecognition about the coherent, hierarchical structure of the text”,Language Process Society, the 4th annual papers on “the today and futureabout the text summary”, 1998, 72-79). The results can be presented onthe display screen for users to easily and quickly read out andunderstand the summary of the text.

[0050] Although the invention has been described with reference to thespecific embodiments, the invention is not intended to be limited tothose embodiments.

What is claimed is:
 1. A text segmentation apparatus comprising: meansfor analyzing an electronic text to determine likelihood of segmentationpoint for each of sentence ends in said text based on a coherent unit;and means for segmenting said text into text segments based on saidlikelihood of segmentation point and a specified text segmentation size.2. A text segmentation apparatus comprising: means for analyzing anelectronic text to determine likelihood of segmentation point for eachof sentence ends in said text based on a coherent unit; and means forsegmenting said text into text segments based on said likelihood ofsegmentation point, wherein when the size of any of said segmented textsegments exceeds a threshold value to be determined based on a specifiedtext segmentation size, said text segmentation apparatus is programmedto segment said text segment at the position having best likelihood ofsegmentation point within said text segment.
 3. The text segmentationapparatus as claimed in claim 2, further comprising means for setting upa pair of windows, each having a predetermined window size, on both leftand right sides of each of said sentence ends in said text, fordetermining similarity of terms contained in said left and rightwindows, wherein said means for analyzing determines the likelihood ofsegmentation point based on said similarity.
 4. The text segmentationapparatus as claimed in claim 3, wherein said means for setting updetermines an overall likelihood of segmentation point F(c) based on aplurality of likelihood of segmentation point f(c) each of which isdetermined respectively for each of a number (L) of different windowsizes, where c represents a respective sentence end position.
 5. Thetext segmentation apparatus as claimed in claim 2, wherein when the sizeof any of said segmented text segments is smaller to a predetermineddegree than the specified text segmentation size, said apparatus isprogrammed to revisit the previous unsegmented text segment so as tosegment said text at the position having second best likelihood ofsegmentation point.
 6. The text segmentation apparatus as claimed inclaim 2, wherein said apparatus is programmed to determine thesimilarity between the segmented text segments and form associationlinks on the text segments if their determined similarity exceeds apredetermined threshold value.
 7. The text segmentation apparatus asclaimed in claim 6, wherein said text segments are formatted using amarkup language, and said association links are embedded in said textsegments using said markup language.
 8. A display device receiving thesegmented text segments from the text segmentation apparatus as claimedin claim 2 for displaying said text segments in sequence in terms oftheir association.
 9. The display device as claimed in claim 8, whereinsaid segmented text segments are associated each other based on thesimilarity between the text segments or the global text structure andthen they are displayed in sequence of such association.
 10. The textsegmentation apparatus as claimed in claim 2, wherein said specifiedsize is determined in accordance with the characteristics of the displaydevice for displaying said text segments.
 11. A text segmentation methodcomprising the steps of: analyzing an electronic text to determinelikelihood of segmentation point for each of sentence ends in said textbased on a coherent unit; and segmenting said text into text segmentsbased on said likelihood of segmentation point, wherein when the size ofany of said segmented text segments exceeds a threshold value to bedetermined based on a specified text segmentation size, said textsegment is segmented at the position having best likelihood ofsegmentation point within said text segment.
 12. The text segmentationmethod as claimed in claim 11, further comprising a step for setting upa pair of windows, each having a predetermined window size, on both leftand right sides of each of said sentence ends in said text and fordetermining the similarity of terms contained in said left and rightwindows, wherein said step for analyzing determines the likelihood ofsegmentation point based on said similarity.
 13. The text segmentationmethod as claimed in claim 12, wherein said step for setting updetermines an overall likelihood of segmentation points F(c) based on aplurality of likelihood of segmentation points f(c) each of which isdetermined respectively for each of a number (L) of different windowsizes where c represents respective sentence end positions.
 14. The textsegmentation method as claimed in claim 11, further comprising a stepfor revisiting previous unsegmented text segment so as to segment saidtext at the position having second best likelihood of segmentation pointwhen the size of any of said segmented text segments is smaller to apredetermined degree than the specified text segmentation size.
 15. Thetext segmentation method as claimed in claim 11, further comprising astep for determining the similarity between the segmented text segmentsand form association links on the text segments if their determinedsimilarity exceeds a predetermined threshold value.
 16. The textsegmentation method as claimed in claim 15, wherein said text segmentsare formatted using a markup language, and said association links areembedded in said text segments using said markup language.